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ABSTRACT 

Despite declining sequencing costs, few methods 
are available for cost-effective single-nucleotide 
polymorphism (SNP), insertion/deletion (INDEL) and 
copy number variation (CNV) discovery in a sin- 
gle assay. Commercially available methods require 
a high investment to a specific region and are only 
cost-effective for large samples. Here, we introduce 
a novel, flexible approach for multiplexed targeted 
sequencing and CNV analysis of large genomic re- 
gions called multiplexed direct genomic selection 
(MDiGS). MDiGS combines biotinylated bacterial ar- 
tificial chromosome (BAC) capture and multiplexed 
pooled capture for SNP/INDEL and CNV detection 
of 96 multiplexed samples on a single MiSeq run. 
MDiGS is advantageous over other methods for CNV 
detection because pooled sample capture and hy- 
bridization to large contiguous BAC baits reduces 
sample and probe hybridization variability inherent 
in other methods. We performed MDiGS capture for 
three chromosomal regions consisting of ^550 kb of 
coding and non-coding sequence with DNA from 253 
patients with congenital lower limb disorders. PITX1 
nonsense and H0XC11 S191F missense mutations 
were identified that segregate in clubfoot families. 
Using a novel pooled-capture reference strategy, we 
identified recurrent chromosome chr17q23.1q23.2 
duplications and small H0XC5' cluster deletions (51 
kb and 12 kb). Given the current interest in coding 
and non-coding variants in human disease, MDiGS 
fulfills a niche for comprehensive and low-cost eval- 



uation of CNVs, coding, and non-coding variants 
across candidate regions of interest. 

INTRODUCTION 

As candidate genes are identified for human diseases, there 
is an increasing need to sequence the regions of interest at 
high depth in larger populations. Although custom capture 
and amplicon-based sequencing methods are commercially 
available, they require a high initial investment in probe de- 
signs and only become cost-effective when amortized over 
large sample sizes that are often not available for rare dis- 
orders. Additionally, many methods focus on a single type 
of variant such as SNPs and INDELs and are poorly de- 
signed to detect copy number variations (CNVs), or are lim- 
ited to only the coding region. This presents a paradox for 
researchers because a significant financial investment must 
be made for a specific region and type of variation before 
having the evidence that these variants are relevant for the 
disorder in question. 

Despite declining sequencing costs and the development 
of novel targeted capture methods, even modestly sized se- 
quencing projects are still cost prohibitive for most investi- 
gators. Therefore, we set out to develop a low-cost method- 
ology that combines SNP/INDEL detection with CNV 
analysis across multiple coding and non-coding regions. 
The goal was to develop an approach that was scalable for 
both small and large studies, yet flexible enough to apply 
to candidate resequencing projects ranging from multiple 
large genomic regions to large multi-exonic genes that can 
be especially costly to sequence. To achieve this, we devel- 
oped a novel approach called multiplexed direct genomic 
selection (MDiGS), which utilizes a biotinylated bacterial 
artificial chromosome (BAC) based capture (1) combined 
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with multiplexed pooled capture of up to 96 samples (2) 
and next-generation sequencing for SNP/INDEL detection 
and highly accurate CNV analysis. BACs tile the human 
genome {n = 208 922) with an average size of 147 514 bp 
(25 001-349 971 bp range). These BACs cover >92% of 
the human genome, >99.4% of the consensus coding se- 
quence (CCDS), and are readily available for $80 each. Be- 
cause MDiGS avoids the costly investment required for cus- 
tom probe designs, MDiGS provides low-cost and compre- 
hensive alternative for candidate screening of large genes 
and gene clusters, analysis of extensive non-coding regions, 
analysis of linkage and GWAS loci for small sample sizes, 
or as a pilot screen before committing to a region of interest 
with alternative methods. 

MDiGS introduces a novel copy number analysis method 
that provides distinct advantages over currently available 
capture methods. Major limitations of other targeted cap- 
ture CNV detection methods are sample-to-sample vari- 
ability and probe hybridization variability, both of which 
are minimized in MDiGS. With many non-multiplexed 
methods, sample variability results from differences in pre- 
and post-capture library amplification and efficiency of in- 
dividual hybridization reactions, making it difficult to de- 
termine if coverage differences represent true CNVs. The 
use of small non-contiguous probes as capture baits is also 
problematic, since individual probes are inherently difficult 
to quantify precisely and equivalently in batches. Although 
several analysis methods have been proposed to improve 
CNV detection in non-contiguous targeted capture data (3- 
5), MDiGS addresses both of these challenges by perform- 
ing pre-capture DNA multiplex poohng and hybridization 
to large contiguous BAC baits. Since all amplification and 
hybridizations steps are performed as a pool, sample-to- 
sample variabihty is reduced allowing for a high signal-to- 
noise ratio. Because the BAC bait is prepared from a single 
linear stretch of DNA, region-to-region differences are re- 
duced, the Hkelihood of identifying reads spanning break- 
points is increased, and CNVs are more accurately detected. 

We applied MDiGS to a study of human lower limb dis- 
orders in which rare CNVs, SNP/INDELs and non-coding 
regions had previously been implicated. We captured and 
sequenced pools of up to 96 indexed samples for 554 kb 
of genomic sequence across three chromosomal regions as- 
sociated with lower limb disorders (5^ HOXC cluster (6), 
PITXl (7,8) and TBX4 (9)) on a single MiSeq run. Using 
MDiGS, we found comparable or better coverage of cod- 
ing and non-coding sequences compared to whole exome 
sequencing (WES) and whole genome sequencing (WGS), 
particularly in problematic GC-rich regions (10,11) com- 
pared to WGS. Sensitivity and specificity of the method was 
demonstrated by 100% concordance for all variant and ref- 
erence genotyping calls for samples sequenced with MDiGS 
and WES. CNV detection using MDiGS was compared to 
samples analyzed on the Affymetrix 6.0 platform and com- 
parable sensitivity and specificity for large CNVs (>100 
kb) and increased sensitivity for small CNVs (>12 kb) was 
demonstrated. MDiGS provides a powerful and flexible, 
low-cost method of candidate screening for SNP/INDELs 
and CNVs and is widely applicable for resequencing of large 
genes and multiple genomic regions. 



MATERIALS AND METHODS 

Patient samples and controls 

We identified 253 patients with idiopathic lower limb abnor- 
malities (clubfoot, vertical talus, tibial hemimelia and Poly- 
dactyly) from the Washington University Musculoskele- 
tal DNA Databank who were recruited from St Louis 
Children's Hospital and St Louis Shriners Hospital. Pa- 
tients were excluded from the study if they had addi- 
tional non-limb congenital anomalies, developmental de- 
lay or known underlying etiologies such as arthrogryposis, 
myelomeningocele or myopathy. DNA was collected from 
blood and saliva from affected probands and family mem- 
bers after obtaining informed consent. DNA extractions 
were performed using the manufacturer's protocols using ei- 
ther the DNA Isolation Kit for Mammalian Blood (Roche, 
Indianapolis, IN, USA) or the Oragene Purifier for saliva 
(DNA Genotek, Kanata, ON, Canada). 

Multiplexed direct genomic selection 

Enrichment of large genomic regions was performed on 
cases and controls using the MDiGS method. MDiGS 
uses a biotinylated BAC-based capture (1) combined 
with multiplexed pooled-capture (2) and next-generation 
sequencing; the protocol is described in detail in Sup- 
plementary Methods. Briefly, individual samples were 
indexed in batches of 96 using 150 ng of genomic DNA. 
Indexed samples were pooled in batches of 48 prior to BAC 
capture. BACs were obtained from BACPAC Resources 
(Oakland, Cahfornia), purified with NucleoBond BAC 100 
(Clonetech), biotinylated with the Nick Translation System 
(Invitrogen) and used to capture the following coding and 
non-coding genomic regions: FBNl (RP11-147E14 and 
RP11-1144G24 chrl5:48676783^8985158), FBN2 (RPll- 
909P14 and RP11-351A8 chr5:127567172-127906815), 
PITXl (RP11-17G15 chr5:134350801-134514499), HOXC 
cluster (RP11-578A18 chrl2:54225979-54410228) and 
TBX4 (RP11-905P16 chrl7:59468421-59674379). Ninety- 
six multiplexed post-capture samples were sequenced using 
a single 500 cycle MiSeq Personal Sequencer run (Illumina, 
San Diego, California). 

Exome and whole genome sequencing 

Exome enrichment was performed on DNA from controls 
(/7 = 121) using the SureSelect Human All Exon 50 MB kits 
(Agilent Technologies, Santa Clara, California). Ten cases 
with idiopathic lower limb abnormalities were selected for 
both exome and MDiGS capture to determine the concor- 
dance in variant calling and calculate the median absolute 
pairwise difference (MAPD) scores for WES and MDiGS. 
WGS was performed by Illumina on DNA from controls {n 
= 73) and sequenced to 30 x coverage. All WES and WGS 
samples were sequenced to >95% CCDS at >8x coverage. 

Next-generation sequencing and variant calling 

MDiGS reads were demultiplexed by index, allowing for 
1 bp mismatches (index sequence Hamming distance >2). 
Identical pipelines were used for MDiGS, WES and WGS 
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alignments and raw variant calling. Samples were aligned 
to hgl9 human reference sequence (NCBI 37.0) using 
Novoalign V2.08.02 (Novocraft Technologies, Selangor, 
Malaysia) using the parameters -r none -i 300 100 - 
a [indexed adapters]' and output in SAM format. Variant 
calling was performed using SAMtools vO. 1 . 18 mpileup for 
all capture regions with '-C50 -q 20 -Q20 -Eu' and piped 
to bcftools "view -bvcg". Raw reads were filtered for a read 
depth of 8 and a minimum of three variant reads using 
vcfutils 'varFilter -d 8 -a 3'. To correct for potential in- 
dex switching during pre- and post-capture amplification, 
MDiGS heterozygous calls with fewer than 20% of the reads 
matching the reference or variant allele were changed to the 
homozygous variant or wild-type genotype, respectively, as 
described previously (2). 



Copy number analysis 

Copy number analysis was first performed on MDiGS sam- 
ples using a custom approach to identify large deletions 
and duplications spanning the entire capture region. Poor- 
quality samples with total BAC coverage less than 80% 
at 8x were excluded from CNV analysis. For large CNV 
analysis, each individual sample's read counts mapping to 
each region were first normalized to the other two target 
regions to correct for sample loading efficiencies. Each nor- 
malized test sample and region was individually compared 
to (Method 1) the normalized read counts of two healthy 
controls or (Method 2) the average normalized read counts 
for all other test samples for each corresponding target re- 
gion from the same capture batch. Samples with large du- 
plications and deletions comprising the entire BAC length, 
defined as copy number (normalized test:avg normalized 
reference ratio)/0.5 > 2.5 or < 1.5 respectively, were eas- 
ily identified using this analysis. For small CNV analysis, 
breakpoints within the capture region were detected using 
CNV-seq using global normalization and 1500 bp window 
size (12). 

Copy number analysis was performed individually for 
each test sample against (Method 1) a known healthy con- 
trol reference or (Method 2) against 10 randomly selected 
test samples. Samples with large duplications or deletions 
identified in the first stage of the analysis were excluded as 
reference samples for the pooled reference analysis. Log2 ra- 
tios for each window were averaged across all pooled refer- 
ence comparisons for each test sample. CNVs were defined 
as log2 lb 0.5 and >10 kb and were vahdated by qPCR. 
One-end anchored (OEA) reads, reads with one mapped 
and one unmapped read, were used to identify breakpoints. 
OEA reads with one paired end mapped within 1 kb of pre- 
dicted breakpoints were identified for all samples and the 
unmapped paired ends were mapped using UCSC BLAT to 
identify exact breakpoints. 

MAPD scores were calculated as the median absolute 
log2 difference between the average base coverage for ad- 
jacent 500 bp windows or Agilent SureSelect Human All 
Exon 50 MB target regions within the BAC capture regions. 
A paired Student's t-test was used to calculate /?-values for 
MDiGS versus exome capture MAPD scores. 



Biotinylated BACs 96 indexed genomic libraries (150 ng) 



Hybridize Multiplexed DNA p^^i^^ Libraries 
With Pre-capture PCR 
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pool of 96 samples (MiSeq) 



Streptavidin bead capture 
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SNP/INDEL Analysis Copy Number Analysis 

Figure 1. Overview of multiplexed direct genomic selection (MDiGS) 
methodology. Three BACs were selected for use as baits and biotinylated 
for targeted capture of 554 kb of coding and non-coding sequence sur- 
rounding PITXl, 5' HOXC genes and TBX4. Genomic input DNA (150 
ng) was individually indexed and pooled in batches of 96. Multiplexed 
DNA pools were hybridized with biotinylated BAC baits and enriched tar- 
get regions are sequenced on MiSeq Personal Sequencer. 



RESULTS 

Multiplexed direct genomic selection 

We first identified BACs on three different chromosomes 
spanning the entire coding and flanking regulatory se- 
quence (554 kb) of regions previously associated with 
lower limb development: PITXl, TBX4, and the 5' HOXC 
genes and non-coding RNAs (HOXC8-HOXC13, HO- 
TAIR, H0XC-AS5 and MIR196A2). To determine the con- 
sistency of BAC capture against different target regions, we 
also identified BACs spanning two large multi-exonic genes 
on separate chromosomes that are especially costly to se- 
quence, FBNl (308 kb, 66 exons) and FBN2 (340 kb, 65 ex- 
ons) for separate experiments. BACs were individually bi- 
otinylated using the Nick Translation System (Invitrogen) 
and combined at equal molar ratios for capture of multiple 
target regions (1). 

While the initial direct genomic selection method de- 
scribed by Bashiardes et al required two BAC hybridiza- 
tion steps to sufficiently enrich for targeted regions, we in- 
troduced a non-biotin labeled off-target BAC as a simple 
repeat element blocking agent that allowed us to reduce the 
hybridization to a single step. Indexed libraries from 150 ng 
of genomic DNA were pooled in batches of 48 samples at 
equal concentrations (2) and hybridized with the biotiny- 
lated BACs (Figure 1). Enriched, multiplexed hbraries were 
then sequenced in pools of 96 indexed samples on a single 
MiSeq run. 

We first evaluated the effects of increasing the number of 
target regions on target enrichment. We performed MDiGS 
on pools of 48 samples captured for one, two or three re- 
gions and calculated the number of on-target reads for 
each capture pool (Figure 2A). To determine the techni- 
cal variability in BAC bait production, we performed each 
capture in duphcate using separately prepared BAC baits 
and achieved similar enrichments for each replicate. We 
observed higher on-target percentage as target regions in- 
creased (Figure 2), one region = 25.4% (±2.4), two regions 
= 39% (±3.5) and three regions = 40% (±2.9), however, 
on-target enrichment of individual regions decreased as ex- 
pected as the number of target regions increased. On-target 
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Figure 2. MDiGS Target Enrichment. Pre-capture multiplexed pools of 48 samples were captured for (A) one (FBNl), two (FBNl and FBN2) or three 
different chromosome regions (PITXl, HOXC and TBX4). Separately prepared BAC baits were used for each replicate. (B) On-target enrichment of three 
regions (PITXl, HOXC and TBX4) was consistent for each region and sample (12-14%). 



enrichment for each target region was consistent across all 
samples sequenced on a single MiSeq lane (Figure 2B) and 
we did not observe a difference in total on-target enrich- 
ment for 24 versus 48 sample pre-capture multiplexing (Sup- 
plementary Figure SI). 



MDiGS coverage compared to WGS and exome capture 

MDiGS pooled capture of 96 samples achieved on aver- 
age 97% (±1.9) at >8x coverage across three targeted re- 
gions (554 kb) (Supplementary Figure S2), similar to whole- 
genome sequenced (WGS) samples (^ = 73) sequenced to 
30 X coverage (Figure 3A). MDiGS coverage of coding se- 
quence was comparable to exome captured samples {n = 
121), with 90% at 8 X coverage, which is lower than the non- 
coding coverage because of the high GC content within sev- 
eral coding exons of HOXC and PITXl genes. Coverage 
was slightly reduced for both exome and MDiGS capture 
of GC-rich regions (GC > 60%), however, both capture 
methods performed better than WGS for GC-rich regions 
(Figure 3B). To determine concordance in variant detec- 
tion between MDiGS and exome sequencing, 10 samples se- 
quenced using both methods were compared. Overall, 100% 
concordance was noted for all polymorphic genotypes {n = 
120) and all genotypes (w = 161 961) called at positions cov- 
ered >8x by both methods. 



Large copy number analysis 

Since most targeted resequencing technologies utilize non- 
contiguous probes which can introduce region-to-region 
variability, we compared the median absolute pairwise dis- 
tance (MAPD) scores of 10 samples sequenced by ex- 
ome capture and MDiGS. MAPD scores are calculated 
as the median absolute log2 difference between neighbor- 
ing probes. We calculated the MAPD scores for all non- 
contiguous exome target regions within our BAC baits for 
both MDiGS and exome samples. Since MDiGS utilizes 
each BAC bait as a large contiguous probe, we also calcu- 
lated MAPD scores for 500 bp adjacent windows since this 
is the average size of exome probe targets. MDiGS region- 
to-region variability was significantly reduced compared to 
exome samples for non-contiguous exome target regions {p 
= 1.7 X 10~^) and further reduced using 500 bp adjacent 
windows (/? = 1.1 x 10~^) (Figure 4A). 

Most reference-based copy number analysis methods rely 
on global normalization to correct for sample loading dif- 
ferences (12), which could be particularly problematic for 
identifying large CNVs spanning an entire BAC captured 
by MDiGS. To determine if both large and small CNVs 
could be detected with MDiGS and standard normalization 
methods, we included cases with known Affymetrix 6.0 mi- 
croarray validated CNVs spanning an entire capture region 
as well as cases with CNV breakpoints within a capture re- 
gion. 

To determine if large duplications spanning an entire cap- 
tured target region (i.e. BAC) could be detected, MDiGS 
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Figure 3. MDiGS coverage of target regions. (A) Total coverage of BAC-targeted coding and non-coding regions were equivalent to whole genome se- 
quencing and coverage of coding regions were equivalent to exome-captured sequencing and superior to whole genome sequencing. (B) Overall coverage 
of GC-rich regions decreased as GC content increases in all three platforms (MDiGS, exome sequencing and whole genome sequencing). Both BAG- and 
exome-capture-based methods performed better than whole genome sequencing for high GC-rich genes. 
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Figure 4. MDiGS copy number analysis. (A) Median absolute pairwise difference (MAPD) scores were calculated as the log2 absolute difference between 
adjacent target regions for 10 samples enriched by exome capture and MDiGS to compare region-to-region variability of similar target regions. MDiGS 
MAPD scores were also calculated for 500 bp adjacent windows to demonstrate reduced region-to-region variability achieved by using long contiguous 
BAG baits. Large duplications spanning an entire BAG can be detected in (B) small sample sizes using Method 1 , which compares individual test samples 
to normal controls or (C) large sample sizes using Method 2. Simulated data demonstrate how individual test samples with large CNVs can be correctly 
identified within a multiplexed pool of 96 test samples with unknown CNV state provided that less than 36 samples contain similar CNVs. 



was performed on a pool of 96 DNA samples including 
three cases with known chromosome 17q23.1q23.2 dupli- 
cations (9). Data were initially analyzed with CNV-seq (12) 
that uses a binned coverage comparison of a single test 
sample compared to a single reference sample using global 
or chromosomal normalization. Unfortunately, because the 
chromosome 17q23.1q23.2 duplication represents >37% of 
the entire captured sequence, the chromosomal and global 
normalization methods utilized by CNV-seq effectively re- 
moved the signal across the capture region and the CNV-seq 
method therefore failed to detect all three known duplica- 
tions included in the MDiGS pool. 

Therefore, to identify CNVs with MDiGS, we developed 
two strategies that take advantage of pooled pre-capture 
and the presence of multiple large chromosomal target re- 
gions (BACs). Because pre- and post-capture amplifica- 
tions, bait hybridization and next-generation sequencing 
are performed as pooled reactions, MDiGS reduces the 
technical variability due to library amplification and cap- 
ture hybridizations that are inherent in individually cap- 
tured sample comparisons. Thus, MDiGS provides multiple 
internal capture and sequencing depth controls across each 
sample and contiguous capture region that can be used to 
improve CNV detection accuracy. Therefore, two alterna- 
tive and complementary methods described below were uti- 
lized for CNV detection of MDiGS data. 



Method 1 compares read counts of individual test sam- 
ples to negative controls and is ideal for smaller sample 
sizes. To identify CNVs spanning the entire target region, 
individual sample total on-target read counts for each BAC 
capture region were first compared to the on-target read 
counts for all other capture regions for the same sample: 
TVri = R\ / (R2 + ^3). Each test sample is then individu- 
ally compared to the averaged normalized read counts of 
the two normal controls for the same target regions. Using 
this method, we individually compared three positive con- 
trol samples with 17q23.1q23.2 duplications to two normal 
controls and accurately detected all three patients with du- 
plications (Figure 4B). 

Method 2 identifies rare CNVs from a multiplexed pool 
of MDiGS test samples and is ideal for larger sample sizes. 
Since comparison to controls is highly dependent on the 
quality of the control sample data and utilizes valuable sam- 
ple space within the multiplex pool that otherwise could 
be used for cases, we sought to determine if rare CNVs 
could be identified from a pool of test samples with un- 
known CNV states. We simulated a pooled reference anal- 
ysis of captured samples and calculated normahzed read 
counts for each test sample region compared to the averaged 
normalized read counts across all other test samples (^—1) 
from the same captured batch. Since additional co -captured 
samples could theoretically contain recurrent CNVs of the 
same region, increasing replicates of samples containing 
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whole chromosome 17q23.1q23.2 dupHcations versus nor- 
mal control samples (validated by Affymetrix genome-wide 
6.0 array) were simulated in a single capture analysis. In the 
simulation, copy number state was correctly called for all 
96 samples when less than 36 samples with chromosome 
17q23.1q23.2 duplications were included in a multiplexed 
pool (Figure 4C), suggesting that this method is robust for 
rare CNV detection. 

High-resolution copy number analysis 

We next determined whether small CNVs within the tar- 
get region could be identified and mapped to high res- 
olution using one-end anchored read alignments (OEA), 
which uses mapped reads within 1 kb of a predicted 
breakpoint to identify unmapped paired-ends spanning the 
CNV breakpoints (3,13). To test the sensitivity of small 
CNV detection, we included two samples with one chro- 
mosomal deletion breakpoint within the targeted capture 
regions at chrl2:54165001-54335668 (y HOXQ (6) or 
chr5:134194484-134435123 (PITXl) (7). CNV-seq analy- 
sis was performed using global normalization and a 1500 
bp window size for each sample compared to 10 randomly 
selected test samples from the same pool. Log2 differences 
for each bin were averaged across all reference comparisons 
to reduce background noise and further refine CNV break- 
points. Only bins with averaged log2 +/ — 0.5 were called for 
CNVs using CNV-seq Perl and R packages (12). 

Unlike the chromosome 17q23.1q23.2 duplications that 
were not detected using CNV-seq because they spanned the 
entire target region (see above), deletions that were par- 
tially contained within the target region were detected for 
both PITXl (Figure 5A) and HOXC (Figure 5C) deletions 
with CNV-seq using global normalization and pooled ref- 
erence analysis. An OEA read strategy was then used to 
identify exact breakpoints of CNVs both within and out- 
side the targeted capture regions (3,13). First, we identified 
all OEAs for which the mapped read was within 1 kb of a 
predicted breakpoint and UCSC BLAT was used to map 
the unmapped paired-end spanning the CNV breakpoint. 
Even though 5^ breakpoints were outside the targeted cap- 
ture region for both deletions, this analysis resulted in our 
identification of reads that overlapped both the 5^ and 3^ 
CNV breakpoints, allowing us to precisely map both CNV 
breakpoints for each deletion (Figure 5B and D). 

Novel SNP/INDEL and CNV detection in lower limb mal- 
formations 

To determine the frequency of SNP/INDELs and small 
CNV rare variants in PITXl, TBX4 and 5^ HOXC cluster 
in humans with lower limb disorders, we next performed 
MDiGS on a cohort of 253 patients with lower limb mal- 
formations and 168 controls. We identified and validated 13 
rare missense or small INDEL variants (MAF < 0.01) caus- 
ing protein coding changes in PITXl, H0XC13, H0XC12, 
HOXCll, HOXCIO, H0XC8 and TBX4 and two rare vari- 
ants in HOTAIR, a non-coding RNA located within the 
HOXC 5^ gene cluster that is not typically captured by com- 
mercial exome capture kits (Supplementary Table SI). Six 
of these rare variants causing protein coding changes were 



not previously described in public databases (dbSNP, 1000 
Genomes or EVS) or in the control sample sequence with 
MDiGS. These variants include a novel PITXl frameshift 
mutation and a HOXCll S191F missense mutation that 
both segregate with clubfoot (data not shown). 

To determine whether common variants near PITXl, 
TBX4 or HOXC genes are associated with lower limb ab- 
normalities, we selected 190 unrelated Caucasian cases and 
156 controls for a case/control association analysis using 
PLINK (14). We excluded rare variants (MAF < 0.05), vari- 
ants out of Hardy- Weinberg equilibrium (p < 0.001) and 
variants with missing calls in >10% of cases and controls. 
None of the common variants were significant after correc- 
tion for multiple testing. 

Because large CNVs involving PITXl, TBX4 and HOXC 
genes have previously been described in patients with lower 
limb malformations using Affymetrix 6.0 arrays that have 
a lower size limit of detection of approximately 100 kb 
in our prior studies (6), we were interested in determin- 
ing whether additional smaller CNVs involving these genes 
could be detected with MDiGS. Therefore, copy number 
analysis was performed on 253 MDiGS captured samples 
from patients with clubfoot and other lower limb disorders 
and 168 controls. To confirm the accuracy of CNV calls, 
three samples with known chromosome 17q23.1q23.2 du- 
plications were included in this capture. We first performed 
pooled reference analysis on all cases and controls in or- 
der to identify CNVs that span the entire target region. All 
three samples with known chromosome 17q23.1q23.2 du- 
plications and one additional new case with a chromosome 
17q23.1q23.2 duplication were identified (Figure 6A) and 
no CNVs were predicted in the control samples. The chro- 
mosome 17q23.1q23.2 duplication in this clubfoot case was 
vahdated by qPCR and segregated with an affected parent 
(Figure 6B and C). 

Samples with smaller CNVs were then identified using 
CNV-seq after excluding cases with large CNVs (detected 
by pooled CNV analysis) as reference samples. Two novel 
chromosomal microdeletions were identified in two sep- 
arate cases on chromosome 12 at the HOXC gene clus- 
ter (Figure 7 A and B). These include a 51 kb deletion 
(chrl2:5431 1194-54361982) overlapping the HOXC clus- 
ter identified in a patient with vertical talus and a 12.7 
kb deletion located 5^ of the HOXC gene cluster identi- 
fied in a patient with isolated clubfoot (chrl2:54303751- 
54316500). OEA read analysis was used to identify the pre- 
cise breakpoints as previously described (Figure 7C). This 
5 1 kb microdeletion had not previously been identified with 
Affymetrix 6.0 microarray analysis (data not shown) nor 
was it detected by analysis of WES data with CONTRA 
(15), a copy number analysis tool for targeted resequencing 
that was specifically designed for whole-exome capture data 
(Supplementary Figure S3), demonstrating the superiority 
of MDiGS for detection of small microdeletions. 

DISCUSSION 

MDiGS provides a flexible, low-cost alternative method for 
improved CNV detection and SNP/INDEL screening of 
variable sample sizes. Custom targeted capture with cur- 
rently available commercial products are suitable for large 
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sample sizes in which the individual sample cost is reduced 
by spreading the high cost of custom probes or primers over 
large sample sizes (16-18). However, for rare disorders, large 
sample sizes are not available, and the cost of gene rese- 
quencing per sample becomes prohibitively expensive. Fur- 
thermore, custom capture design locks an investigator into 
a specific experimental design, as the genes and regions to 
be sequenced need to be selected at the beginning of the ex- 
periment, and cannot be modified as samples are captured 
and sequenced, a process that may take several months to 
complete. In contrast, the MDiGS method described here 
utilizes BACs that are readily available at low cost ($80) for 
nearly all chromosomal regions and genes, making MDiGS 
ideal for candidate locus screening. 

MDiGS capture efficiency is similar to previously de- 
scribed pre-capture multiplexing methods (2,19). Although 
on-target percentage is reduced compared to post-capture 
multiplexing (20), enrichment is sufficient to cover >97% of 
target bases at >8x coverage for 554 kb of target sequence 
in 96 samples on a single MiSeq lane. Using only 150 ng of 



genomic DNA with our library prep protocol, pre-capture 
multiplexed DNA libraries can typically be used for up to 
10 subsequent captures with different targets, adding flex- 
ibility and reducing the cost of sequencing additional tar- 
get regions by eliminating the need for costly sample preps. 
BAC-based targeted selection is particularly suitable for re- 
sequencing of large genes with multiple exons since other 
amplicon- and probe-based approaches require individu- 
ally designed baits for each exon. Also, with new interest 
in the role of non-coding regions in human disease (21-23), 
our BAC-based approach comprehensively evaluates non- 
coding regulatory sequences at significantly lower cost than 
WGS. 

MDiGS is based on direct genomic selection (DiGS), a 
BAC-based capture method (1) that achieved 52% on-target 
capture using two BAC hybridization steps (24), although 
it did not utilize multiplexed patient DNA capture and was 
not optimized for CNV analysis. Our modifications of their 
method include use of a non-biotin labeled off-target BAC 
as a blocking agent to reduce hybridization to a single step. 
However, the most significant modification of their protocol 
is our utilization of the BAC-based method to capture mul- 
tiplexed DNA samples in a pooled capture that optimizes 
concurrent CNV analysis. In addition to cost advantages, 
we demonstrate that multiplexed BAC capture of pooled 
samples is highly accurate for large and small CNV and 
SNP/INDEL detection. The MDiGS protocol is now de- 
signed for 96 multiplexed samples and takes advantage of 
an Illumina multiplexed library preparation that requires 
only 150 ng of input genomic DNA (Supplementary Meth- 
ods). MDiGS can also be used with dual indexing to in- 
crease sample numbers or to reduce the potential risk of 
index switching that would be required for diagnostic se- 
quencing. With MDiGS, a single MiSeq run is sufficient to 
analyze 96 patient samples for up to 550 kb of sequence 
across multiple targeted genomic loci. Although we have 
not identified the upper limits of genomic sequence that can 
be covered in a single MiSeq run, enriched libraries can be 
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Figure 7. Small internal CNVs detected by MDiGS. (A and B) Two novel microdeletions located within the 5' HOXC gene cluster were identified in two 
patients with lower limb malformations, 8029001 (51 kb) and 5260003 (12 kb), using merged CNV-Seq analysis. Breakpoints were identified with (C) 
one-end anchored (OEA) read analysis. 



scaled up to HiSeq 2000 lanes for larger number of sam- 
ples or post-capture libraries can be pooled for increased 
targeted regions. 

A unique strength of MDiGS is the additional high- 
resolution and accurate CNV detection capabilities ob- 
tained from pooled capture in addition to the detection of 
SNPs and INDELs. Identification of a variety of genetic 
and genomic abnormalities using a single platform provides 
more comprehensive candidate gene analysis than is cur- 
rently available. Using MDiGS, we demonstrate two distinct 
strategies to reliably detect (i) common CNVs by compari- 
son to healthy reference samples, which is ideal for smaller 
sample sizes, and (ii) rare CNVs in a multiplexed pool of test 
samples with unknown CNV state. With MDiGS, smaller 
CNVs with one or both breakpoints within the capture re- 
gion are identifiable with available CNV detection software 
such as CNV-seq. We also demonstrated that using large 
contiguous BAC baits provides lower region-to-region vari- 
ability and higher sensitivity CNV detection compared to 
non-contiguous probes used for exome targeted capture. 
Furthermore, OEA reads can be used to map exact break- 
points with higher resolution than microarray-based CNV 
detection methods. 

Our application of the MDiGS method to candidate gene 
screening of 253 patients with lower limb malformations 
identified novel single nucleotide polymorphisms and novel 
copy number variants in PITXl, TBX4 and HOXC genes. 



confirming the importance of both CNVs and point muta- 
tions in the etiology of these genetically heterogeneous dis- 
orders. Furthermore, our method allowed us to identify ad- 
ditional mutations in the non-coding regulatory sequences 
of all three genes that can now be studied for their role in 
hindlimb development. While there are a variety of meth- 
ods available for targeted sequencing and CNV analysis, 
MDiGS offers a valuable cost-effective alternative for com- 
prehensive screening of genomic loci. 
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